STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000 – 30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.1


Introduction
Multi-modal data, especially image-text pairs, has gained significant importance and popularity (Srinivasan et al., 2021;Schuhmann et al., 2022), driven by the recent success of multi-modal models such as Contrastive Language-Image Pre-Training (CLIP, Radford et al. (2021)).Researchers have been leveraging these models and data across various fields due to their versatility.Initially, many image models were trained to predict a fixed set of predetermined object categories, limiting their generalizability to identify other visual objects or concepts.Learning directly from text descriptions about images provides a complementary and broader source of supervision, expanding the range of potential applications.
Histopathology plays a crucial role in medical diagnostics, focusing on the microscopic examination of tissue samples to detect diseases and guide treatment decisions (Bera et al., 2019).It helps identify cellular abnormalities, including cancerous cells, inflammation, and tissue degeneration (Reddy, 1996;Bera et al., 2019).The collection of image-text pair data for histopathology requires careful annotation of whole-slide images (Huang et al., 2023;Ikezogwo et al., 2024) to create large-scale datasets suitable for research, training, and diagnostic tool development.Recent efforts to collect and annotate histopathology slides have opened up new opportunities in this domain (Gamper et al., 2019;Graham et al., 2021;Amgad et al., 2019;Huang et al., 2023;Ikezogwo et al., 2024).These annotations vary from simple single labels, such as cell/nuclei types in PanNuke (Gamper et al., 2019) and Lizard datasets (Graham et al., 2021), and cancer regions in NuCLS (Amgad et al., 2019).They also extend to more complex natural language descriptions derived from social media sources such as Twitter or YouTube, as seen in datasets such as OpenPath (Huang et al., 2023) and Quilt-1M (Ikezogwo et al., 2024).Fine-tuning multi-modal models like CLIP with these diverse datasets has shown improved performance in various tasks, including tissue structure classification and image/text retrieval (Huang et al., 2023;Ikezogwo et al., 2024).By combining the capabilities of multi-modal models with detailed histopathology annotations, researchers can achieve greater accuracy and flexibility in medical image analysis.This advancement not only enhances the diagnostic process but also opens the door to new applications in the development of automated pathology tools and, more generally, in personalized medicine (Liu et al., 2019;Nikolov et al., 2021).
While advancements in image annotation have shifted from single labels to natural language descriptions, histopathology slides remain complex and contain a wealth of information that can be challenging to encapsulate in a limited amount of text (Radford et al., 2021;Chen and Zou, 2023).These large tissue slides often feature diverse tissue structures, making it difficult to accurately describe all aspects within a confined-length text.This complexity is further compounded in slides depicting certain diseases, where the focus tends to be on diseased regions, potentially overlooking healthy tissue areas.Randomly cropping images from these slides can lead to misinterpretation and incorrect annotations (Ciga et al., 2021).Histopathology slides are commonly stained with Hematoxylin and Eosin (H&E), revealing details like nuclei and stroma.However, much more biological information exists in these tissue samples, such as gene expression changes and cell-cell communication, which cannot be discerned through staining alone.
Gene expression, the process through which mRNA molecules are generated from the information encoded by the DNA of a gene, is pivotal in studying biological processes.Gene expression data can significantly enhance the annotation of histopathology images.For instance, cancer regions can be identified by the over-expression of specific genes like ERBB2 in human epidermal growth factor receptor 2 (HER2)-positive breast cancer (Andersson et al., 2021).Moreover, gene expression data can support various downstream analyses, such as deconvolution (Chen et al., 2022(Chen et al., , 2023;;Luo et al., 2024), which infers the proportion of different cell types in a sample, or clustering (Yuan et al., 2024;Hu et al., 2021;Luecken et al., 2022), which can reveal distinct cell/tissue types/states.The potential applications of gene expression data mark the potential benefit of such paired image and gene expression data.
Gene expression can be measured through several technologies.Bulk RNA-sequencing provides an average expression across large cell populations (Kukurba and Montgomery, 2015).Single-cell RNA sequencing allows for analysis at the individual cell level, enabling more detailed insights into cellular heterogeneity (Kukurba and Montgomery, 2015).However, it still loses the spatial context within the tissue, which is crucial for integrating gene expression data with pathology images to utilize multi-modal methods effectively.To address this need, we highlight spatial transcriptomics (ST) (Ståhl et al., 2016;Moses and Pachter, 2022), a technology that uniquely measures gene expression while preserving spatial information within the tissue (Figure 1a,b).To be more specific, ST can provide gene expression measurement for individual sub-tiles that altogether make up the whole tissue slide.ST has gained significant attention and popularity in recent years due to this unique ability to measure gene expression within spatial context (Ståhl et al., 2016).These ST technologies have revolutionized the way researchers study tissue, allowing for more in-depth analysis of spatial interactions within the tissue and insights into tissue organization and disease mechanisms (Moses and Pachter, 2022;Tian et al., 2023).A key advantage of ST is its ability to provide both highresolution histopathology images and detailed whole-transcriptome data for each spatial coordinate within a large tissue image (Ståhl et al., 2016).This makes ST a perfect source for paired medical image and text datasets, offering a richer, more accurate annotation that addresses the limitations of over-simplified textual descriptions that typically focus solely on broad categories like cancer or non-cancer regions.By providing high-dimensional annotations for each sub-tile, ST enables a more comprehensive understanding of tissue granularity, facilitating studies of cell-cell communication, tissue architecture, and disease progression (Ståhl et al., 2016;Tian et al., 2023).Despite these advantages, existing datasets that combine pathology images with gene expression data are often limited in size and scope (Fan et al., 2020;Xu et al., 2022;Fan et al., 2023;Yuan et al., Vision-Language Pairs in Histopathology.Multiple histopathology image-text pair datasets have emerged, serving as a foundational resource for studying medical images.The ARCH dataset consists of 8,617 figure-caption pairs with histology or immunohistochemistry (IHC) images, curated from research publications (Gamper and Rajpoot, 2021).The OpenPath dataset offers a broader perspective, featuring 116,504 image-text pairs from Twitter posts across 32 pathology subspecialties, along with 59,869 image-text pairs from replies to popular tweets, and 32,041 additional image-text pairs scraped from the LAION dataset (Huang et al., 2023;Schuhmann et al., 2022).Quilt-1M, a combination of Quilt with datasets from other sources, represents one of the largest vision-language histopathology datasets to date, with over 1 million image-text samples (Ikezogwo et al., 2024).These datasets prove to be valuable resources for training and evaluating models that can understand and correlate textual information with histopathology images.Spatial Omics Datasets.The rise of ST and spatial omics data has spurred the development of various datasets that focus on transcriptomics or other omics data in tissue samples.Notable databases include SpatialDB (Fan et al., 2020), STOmicsDB (Xu et al., 2022), SPASCER (Fan et al., 2023), SODB (Yuan et al., 2023), Aquila (Zheng et al., 2023), Museum of Spatial Transcriptomics (Moses and Pachter, 2022), SORC Zhou et al. (2024), and SOAR (Li et al., 2022).These datasets focus primarily on gene expression data, providing researchers with a wealth of information about the spatial distribution of gene expression in tissue samples.However, there is currently a lack of datasets that provide paired image and gene expression data, which is crucial for bridging the gap between visual information and underlying transcriptomic profiles.
Representation Learning in Medical Imaging.Representation learning has made significant strides in medical imaging.Early models focused on predicting single values such as gene expression (He et al., 2020a) or survival outcome (Chen et al., 2021), while more recent approaches employ self-supervised learning (SSL) techniques to learn from unlabeled image data (Ikezogwo et al., 2022).Contrastive SSL models including PLIP (Huang et al., 2023), Quilt-Net (Ikezogwo et al., 2024) and CONCH (Lu et al., 2024), which use image and label annotation, have gained popularity, with models successfully trained on image-text pairs.However, text encoders are limited by token length, making it challenging to incorporate gene expression data.In the ST field, researchers have explored contrastive SSL for image-gene expression data or other modalities like gene expression paired with protein abundance (Zeng et al., 2023;Long et al., 2023;Yao et al., 2024).These models are typically trained on a single slide, constrained by the lack of large datasets that pair histopathology images with gene expression data, and the challenge of integrating gene expression across different datasets.
3 Curating STimage-1K4M: Overview ST technologies can be broadly categorized into two main types: sequencing-based and imagingbased.Sequencing-based ST technology typically involves capturing spatial information using unique barcodes that correspond to specific regions which are usually called "spots" within a tissue sample (see Figure 1a middle panel and Figure 1b for example).This approach enables researchers to capture the entire transcriptome while retaining the spatial context through the barcodes.Imaging-based ST technology, on the other hand, uses fluorescence or other imaging techniques to visualize gene expression directly in the tissue context, and can reach cellular and even sub-cellular resolution.However, imaging-based ST technology has a limitation: the number of genes it can measure is restricted, due to the complexity of multiple rounds of fluorescence of many genes.To be more specific, sequencing-based technologies like Spatial Transcriptomics (Ståhl et al., 2016), Visium and VisiumHD (10x Genomics) can measure ∼15k-30k genes (Figure 1b) while imaging-based technologies like MERFISH (Chen et al., 2015) and STARmap (Wang et al., 2018) can only measure hundreds of genes (median number of genes around 300 in the SOAR database (Li et al., 2022)).Public available sources for ST data include Gene Expression Omnibus (GEO), 10X Genomics datasets, Spatial Research datasets, and various publications.We queried the GEO website using keywords "spatial transcriptomics", specifically targeting supplementary files in JPG, PNG, or TIFF formats.This search resulted in 856 datasets from 121 unique GEO studies.Additionally, we gathered 58 Visium and 4 VisiumHD datasets from 10X Genomics, complementing these with 233 slides manually collected from 10 additional studies (see Appendix A for a full list of references).
A significant challenge in this process was the inconsistent sharing standards for ST data, particularly for the image components.Many datasets lack corresponding images, making it difficult to analyze the gene expression data in its proper spatial context.For Visium data, the standard format typically includes at least one image, which can be of full-resolution, high-resolution, and low-resolution.In this work, we used the highest resolution images available for each dataset.Spatial Transcriptomics data posed additional hurdles.This kind of data requires CytAssist images to map the coordinates to the image, but these images are rarely publicly available, making it challenging to link gene expression data to histopathology images.Only datasets with mapped and unmapped coordinates could be included in the study for the calculation of spot diameter following ST pipeline in the SpatialTranscriptomicsResearch GitHub repository.Given the various sharing formats and the common absence of key data, it's particularly challenging for researchers unfamiliar with ST to align gene expression data with histopathology images.To address this, we manually processed and verified every dataset to ensure accurate coordinate mapping, allowing precise linking of gene expression data to histopathology images.Furthermore, we calculated and included the corresponding spot radius to indicate the area of measurement.These manual efforts underscore our commitment to providing a reliable and comprehensive dataset, facilitating easier integration of ST data with histopathology images for researchers across various disciplines.
In summary, we systematically collected a diverse collection of 1,149 ST slides, encompassing 4,293,195 spots with paired gene expression information.For each dataset, we provide histopathology images, spot center coordinates and radius, as well as the associated gene expression data.Our STimage-1K4M dataset comprises of data from Spatial Transcriptomics, Visium, and VisiumHD platforms.At the slide level, STimage-1K4M has 13.1% from Spatial Transcriptomics, 86.5% from Visium, and 0.3% from VisiumHD.At the spot level, due to the resolution difference, the composition shifts to 1.4% from Spatial Transcriptomics, 54.4% from Visium, and 44.2% from VisiumHD.STimage-1K4M predominantly includes data from human and mouse, encompasses 50 tissues with the largest proportion of images from brain, accounting for 21.8% (251 slides), followed by the breast tissue at 17.8% (205 slides).Given a major focus on cancer in the field of ST, it's noteworthy that 39.7% of the slides (456 slides) are from studies related to cancer.
In addition to the paired image and gene expression data, we also included pathologist annotations for the slides (Figure 4).Spatial domain detection or clustering is a popular topic in ST data analysis.However, due to the lack of organized datasets, evaluations in most ST clustering methods utilizing image data rely on limited samples (Andersson et al., 2021;Maynard et al., 2021).We manually reviewed relevant publications and extracted annotations from 9 studies including 71 slides to enrich our STimage-1K4M dataset.These pathologist annotations are anticipated to substantially reduce efforts required for collecting labeled data with "ground truth" in the ST field and to provide researchers with a more comprehensive resource for evaluating clustering methods and dimension reduction techniques.
As a comprehensive and meticulously curated dataset, STimage-1K4M aims to facilitate research in ST, computational pathology, and related fields.This dataset can significantly streamline the data collection process, allowing researchers to focus on developing innovative methods and gaining deeper insights into tissue structure and gene expression patterns.

Popular tasks using ST images
Within the field of traditional computational biology with no gene expression involved, commonly performed tasks such as tissue type classification and image-text retrieval have well-established solutions.However, ST introduces new complexity and opportunity with additional gene expression information.ST data allows researchers to engage in a variety of specialized tasks that are particularly suited to the strengths of this new type of technology.Gene Expression Prediction and Resolution Enhancement.One key usage of images in ST is predicting gene expression (Figure 2a) from histopathology images (Xie et al., 2024;He et al., 2020a).This approach allows researchers to infer gene expression levels from visual data, potentially reducing the need for expensive and time-consuming library preparation and sequencing.Additionally, increasing the resolution of gene expression data through high-quality imaging techniques offers a more detailed understanding of spatial patterns within tissue samples, leading to improved accuracy in analyzing gene expression spatial distributions (Hu et al., 2023;Zhang et al., 2024).
Representation Learning and Clustering.Similar as in image-based computational biology, learning image embeddings is also a popular task in ST (Figure 2b).This process involves transforming high-resolution tissue images into compact, informative representations that capture the essential features of the underlying biological processes.A key application of these embeddings is spatial clustering (Figure 2c), where similar tissue regions are grouped based on shared characteristics captured in the embeddings (Hu et al., 2021).Clustering allows researchers to explore tissue heterogeneity and identify distinct spatial clusters that may correspond to different cellular functions or disease states.
Deconvolution and Cell Segmentation.Deconvolution and cell segmentation are valuable computational methods that enhance our understanding of tissue composition at cellular level (Figure 2d).Deconvolution specifically focuses on deciphering mixed signals within spot-level gene expression data to accurately estimate the proportions of contributing cell types present in a tissue sample.Histology images are particularly valuable in this context because common staining methods inherently highlight nuclei information, providing a clear visual representation for cellular structures (Biancalani et al., 2021;Chen et al., 2023).This visual clarity allows deconvolution via computational methods, as spots that appear similar in the images are likely to have similar cell type compositions.Additionally, the integration of image analysis with deconvolution facilitates the application of trained models to new images or to areas within images where spots were not initially measured, potentially increasing analysis resolution.Furthermore, by employing cell segmentation techniques alongside these images, researchers can precisely identify and categorize individual nucleus, which allows accurate assignment of specific cell types to these identified nucleus, thereby enriching the gene expression data with detailed cellular annotations (Biancalani et al., 2021;Zhang et al., 2024).

Experiment training with STimage-1K4M
To demonstrate the effectiveness of our STimage-1K4M dataset, we employed contrastive learning to fine-tune the image encoders of pre-trained CLIP and PLIP models using STimage-1K4M, to enhance the models' performance in integrating pathology images with corresponding gene expressions.To effectively incorporate gene expressions, we replaced the text encoder in these models with fully connected neural networks, as shown in Figure 2b and Appendix B. The objective of our contrastive learning remains consistent with the original CLIP framework, aiming to increase the cosine similarity between embeddings of aligned pairs while minimizing similarity for unaligned pairs.Given the challenges of different genes measured across datasets and prevailing batch effects, we limited our analyses to samples from Maynard et al. (2021), which includes 12 human dorsolateral prefrontal cortex (DLPFC) slides encompassing 47,681 spots.To manage the high dimensionality of gene expression data, we explored two strategies: highly variable genes (HVG) selected separately from each slide, and HVGs selected from overlapping genes across slides (overlap-HVG).Once fine-tuned, we conducted experiments for image classification using linear probing and analyzed the image embeddings through t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten and Hinton, 2008).See Appendix for experiment details.Evaluation using linear probing.We evaluated the performance of the fine-tuned models via linear probing.This involved training a simple linear classifier on 80% of the data, sampled with five different seeds, using the embeddings from both the fine-tuned and zero-shot models (CLIP (Radford et al., 2021), PLIP (Huang et al., 2023), and UNI (Chen et al., 2024)).As shown in Figure 3a, the fine-tuned CLIP and PLIP with HVG achieved higher mean F1 scores that zero-shot CLIP and PLIP models, indicating that fine-tuning on our STimage-1K4M improves the performance.While we did not fine-tune the larger UNI model due to computational constraints, the results suggest that both fine-tuning with our dataset and using a more effective pre-trained model contribute to better performance.We conjecture that fine-tuning UNI on our STimage-1K4M could further enhance its performance, combining the benefits of both advanced model architecture and tailored training data.
Image representation learning.To evaluate the enhancement in image representations achieved by the fine-tuned models, we utilized pathologist-annotated brain layers (Figure 3c) as benchmarks to calculate several cluster quality metrics (Figure 3b), including the Silhouette score (Rousseeuw, 1987), the Calinski-Harabasz index (Caliński and Harabasz, 1974), and the Davies-Bouldin index (Davies and Bouldin, 1979).Additionally, we applied t-SNE (Van der Maaten and Hinton, 2008) for visaulization to further analyze the clustering patterns(Figure 3d).Our findings indicate that, compared to zero-shot embeddings, the fine-tuned embeddings more effectively distinguish between various tissue subtypes, notably between white matter (WM) and other layers (L1-L6) in the brain (Figure 3b,d).In particular, the image embeddings from the fine-tuned models outperform all zeroshot image embeddings.This enhancement suggests that incorporating gene expression data into the training process helps the model capture more nuanced differences within the tissue slides, which highlights the potential of integrating genetic and image information to learn more precise and informative interpretations of tissue structure and function.

Discussion
In this work, we introduced STimage-1K4M, a groundbreaking open-source dataset that pairs histopathology images with gene expression data.Our empirical results demonstrate the effectiveness of pre-training using STimage-1K4M, which has shown to outperform larger state-of-the-art models such as CLIP and PLIP.This success highlights the significant potential of integrating image and gene expression data to enhance model performance and provide new opportunities for advancing research in spatial transcriptomics and computational pathology.Despite these promising results, this emerging field also presents significant challenges that require innovative approaches to overcome.Next, we discuss the potentials and challenges associated with this integration.High-dimensional image.A typical histopathology image consists of three primary color channelsred, green, and blue (RGB).These channels represent the standard visualization used in most imaging technologies to capture the visual structure and patterns in tissue samples.When histopathology images are paired with gene expression data, the data dimension increases enormously, presenting both challenges and opportunities for analysis.If gene expression data is treated as a separate set of "channels", where each gene's expression level is represented as a gray-scale image channel, the entire histopathology image transforms into a high-dimensional data structure.Instead of having just three RGB channels, the transformed image would now have around ∼20,000 channels, each representing the expression of a different gene.This high-dimensionality adds molecular information to the visual data, offering insights far beyond what can be revealed by staining methods.
Given this expanded data structure, a crucial question arises: How can this high-dimensional data be effectively analyzed and utilized?One of the central challenges is to strike an optimal balance between sample size and resolution.When focusing on spot-level images, there's a risk of losing spatial connections between the spots.On the other hand, slide-level information provides a broader context but at the cost of reduced sample size, which could limit the scope of analysis.This challenge leads to further questions: How can slide-level information improve the image embeddings for spot-level images?Can the data be augmented by pairing it with datasets containing image-text or purely image-based information?
For spot-level data, several approaches have been attempted, including contrastive SSL (Long et al., 2023;Yao et al., 2024), but these approaches typically concentrate on spot-level images from a single slide, limiting their generalizability.Determining the optimal approach for analyzing multi-dimensional datasets remains an open question.Should researchers employ contrastive SSL, where models learn from paired image-gene expression data, or treat the dataset as a multi-channel image, where traditional image-processing techniques can be applied?These questions are central to the ongoing evolution of computational pathology, as they determine the effectiveness of latent embedding extraction and ultimately influence models' performance in real-world applications.
Position encoding.In traditional vision transformer models (Dosovitskiy et al., 2020), positional encoding is used to provide context about the relative or absolute positions of input image patches within a sequence.This is crucial because transformers, unlike convolutional neural networks, do not inherently retain information about the order or spatial arrangement of their inputs.Positional encoding typically involves adding a set of coordinates or numerical values to the model's inputs, enabling the model to understand spatial relationships and preserve structure during analysis.In the context of integrating histopathology images with gene expression data, gene expression data could potentially serve as a unique form of positional encoding.By linking specific regions within an image to their corresponding transcriptomic information, researchers can create spatially-aware models that can learn from both visual and transcriptomic cues.

Gene expression annotation.
Gene expression data has become an indispensable resource in the annotation of complex biological datasets, offering insights into molecular and cellular activity as well as underlying mechanisms of various biological processes.It has been widely used for various downstream analysis including clustering, which classify cells/spots into distinct groups based on their gene expression profiles, and deconvolution, which estimate the composition of cell types in a spot.Recent advancements include integrating large language models (LLMs) to extract meaningful gene expression embeddings (Chen et al., 2023;Schaar et al., 2024), which utilize gene names and text descriptions to enhance data interpretation.However, several significant challenges remain.Variation in genome structures across different species complicate cross-species analysis, and batch effects introduce systematic biases in gene expression measurements.Additionally, using gene names with rankings (Chen and Zou, 2023;Schaar et al., 2024) lacks the precision of quantitative values, and high-dimensionality necessitates effective dimension reduction techniques.Current methods, such as using PCs or HVGs, often faill short in multi-slides analysis across different tissue and species.
Out STimage-1K4M dataset has the potential to address these challenges by providing a large, diverse collection of paired histopathology images and gene expression data across multiple species and tissue types.This dataset may facilitate the development of robust annotation methods that manage high-dimensional data and mitigate batch effects.
Limitations.In this work, although we have shown that the integration of gene expression data has enhanced the performance of the pre-trained CLIP and PLIP image encoders, the fine-tuned models are still inferior to the UNI model, suggesting that employing a more powerful foundational model could potentially yield further improvements.However, due to limited computational resources, we were unable to fine-tune the UNI model.Additionally, we only utilized a maximum of 128 dimensions, compressed into a 32-dimensional latent layer, to incorporate gene expression data.This simplistic implementation may not fully capture the complexity and richness of gene expression information.Efforts to fine-tune the models using data from other tissue types, resulted in suboptimal performance.This suggests that batch effects across datasets introduce noise and variability, significantly impacting model performance.

B CLIP and PLIP finetuning details
To evaluate the potential of STimage-1K4M, we fine-tuned the image encoder part of CLIP (Radford et al., 2021) and PLIP (Huang et al., 2023) model.All model implementations are built upon the training code (issue #83) posted in the CLIP GitHub repository.The hyperparameters are chose to be the same of CLIP training.All the parameters are transformed into fp32.For CLIP, we loaded the pretrained parameters (ViT-B/32) from openai/clip-vit-base-patch32 from hugging face.For PLIP, we loaded the pretrained parameters (ViT-L/14) from vinid/plip from hugging face.In our model architecture, the image encoder feeds into a fully connected layer that reduces its output to a 32-dimensional latent space.Similarly, the gene expression encoder also consists of a single fully connected layer that compresses its high-dimensional input down to a 32-dimensional embedding.These 32-dimensional representations from both the image and gene expression encoders are then utilized as the basis for our contrastive loss function.We fine-tuned the models for 15 epochs.All experiments are performed on single NVIDIA A100 GPU.
To compare the choice of gene sets, we employed two methods.
1. HVG For each dataset, we selected the top 128 HVGs.Then the HVGs of all dataset are sorted by highly variable rank and combined regardless of gene names as training data for fine-tuning.The HVG are selected using python scanpy package scanpy.pp.highly_variable_genes default settings (Lause et al., 2021;Wolf et al., 2018).2. Overlap HVG For each training data, we first combined the gene expression with respected to gene names and only keep the overlapping genes.Then we select the top 100 HVGs for the combined gene expression as training data.The HVG are selected using python scanpy package scanpy.experimental.pp.highly_variable_genes default settings (Satija et al., 2015;Wolf et al., 2018).As discussed in Section 6, the gene names of different species are not overlapping.Thus, in order to use combined gene expression from mulitple datasets, we subset STimage-1K4M by species.In this study, we fine-tuned models using data from (Maynard et al., 2021) (human brain) with different gene sets.

C Linear probing details
For linear probing, we follow the procedure in (Huang et al., 2023).We employ the stochastic gradient descent classifier (SGDClassifier) module for logistic regression classifier from the sklearn (Pedregosa et al., 2011) Python package.For fine-tuned models, we used 32-dimension image embeddings.For zero-shot models, we used 512-dimension emebddings.The performance of the trained linear classifier using L2 regularization with different regularization multipliers (α = 1, 0.1, 0.01, 0.001, 0.0001) are evaluated on the validation splits for all models (training:validataion:test = 8 : 1 : 1).The best-performing linear classifier was selected based on the average macro F1 performance from the results trained on training splits sampled by five different random seeds (seed = 1, 2, 3, 4, 5).

D t-SNE details
We applied t-SNE (Van der Maaten and Hinton, 2008) to each slide individually using sklearn.manifold.TSNE function from the sklearn (Pedregosa et al., 2011) Python package.For the fine-tuned models, we utilized 32-dimensional embeddings, whereas for the zero-shot models, we employed 512-dimensional embeddings.We also used this setting in our calculations of the Silhouette, Calinski-Harabasz, and Davies-Bouldin scores using the sklearn (Pedregosa et al., 2011) Python package.

Composition
• What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?This dataset includes histopathology images, spatial coordinates for spots and the paired gene expressions.• How many instances are there in total (of each type, if appropriate)?STimage-1K4M includes 1,149 whole-slide images and 4,293,195 spots (sub-tiles) and the expression of 15,000-30,000 genes associated with each spot.• Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?The STimage-1K4M dataset is not a subset of a larger collection but rather an extensive compilation of all available instances we could identify and collect.We made a comprehensive effort to collect as much data as possible, specifically targeting datasets from the Gene Expression Omnibus (GEO) that contain both pathology images and gene expression data.While it may not cover every possible instance due to the inherent limitations of data availability and access, it represents the most exhaustive collection of such data currently available.We are committed to updating the dataset as more data or technologies become available.• What data does each instance consist of?We consider each spot as an instance, which has high dimensional gene expression data, image data and spot coordinate data.• Is there a label or target associated with each instance?Yes, the gene expression data could be treated as label for each image.• Is any information missing from individual instances?We provide extra information like abstract, paper title.Such information is missed in datasets without a valid publication id.• Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)?Yes, all instances of the same slide/dataset and their spatial relationship could be analyzed using the spatial coordinate file.• Are there recommended data splits (e.g., training, development/validation, testing)?There are no recommended data splits, but potentially the data could be split by tissue type.• Are there any errors, sources of noise, or redundancies in the dataset?While the STimage-1K4M dataset is carefully curated, the source ST data may contain inherent noise and errors due to the limitations of the technology used.We are not aware of any redundancies in the dataset.
• Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?The dataset is self-contained.• Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?All the data in STimage-1K4M is from public avaible source, we are not aware of such confidential information.we collected the data from publicly available sources with no contact with individuals involved in the study.• Did the individuals in question consent to the collection and use of their data?Not applicable, we collected the data from publicly available sources without contact with individuals involved in the study.We cite all the studies included in the dataset.• If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?Not applicable, we collected the data from publicly available sources.• Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?Not applicable, we collected the data from publicly available sources.

Preprocessing/cleaning/labeling
• Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?Yes, the meta data like tissue type were cleaned manually.All code related to label cleaning is available in the GitHub repository https://github.com/JiawenChenn/STimage-1K4M.• Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?Yes, all raw labels were saved.• Is the software that was used to preprocess/clean/label the data available?Yes, all code related to label cleaning is available in the same GitHub repository. 5. Uses • Has the dataset been used for any tasks already?No.
• Is there a repository that links to any or all papers or systems that use the dataset?Yes, all sources are available at https://github.com/JiawenChenn/STimage-1K4M.• What (other) tasks could the dataset be used for?The dataset could be used for training self-supervised models to better understand histopathology and gene expression.
• Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?Yes, we would recommend our format as the future data format release for ST data.• Are there tasks for which the dataset should not be used?We are not aware of such task.6. Distribution • Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?Yes, all the data are distributed under a permissible license for research-based use.• How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?
The dataset is distributed on ftp server.• When will the dataset be distributed?The dataset is released with a permissible license for research-based use.• Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use(ToU)?The use of data will be under a permissible license for research-based use.• Have any third parties imposed IP-based or other restrictions on the data associated with the instances?We are not aware of such restrictions.• Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?We are not aware of such restrictions.7. Maintenance • Who will be supporting/hosting/maintaining the dataset?The first author of the paper.• How can the owner/curator/manager of the dataset be contacted (e.g., email address)?The first author and corresponding authors could be contacted using email listed in the paper or through GitHub.• Is there an erratum?No.
• Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?Yes, the dataset will be updated periodically to ensure data quality.We are also committed to continually expanding the dataset by adding new samples as they become available.• If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?We are not aware of such limits.• Will older versions of the dataset continue to be supported/hosted/maintained?
All versions of the dataset will be available.• If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?Not at this time.

F Additional dataset information
STimage-1K4M is released at https://github.com/JiawenChenn/STimage-1K4Mwith metadata record also contained in this repository.The license for the data use is a permissible license for research-based use, which is described detailly in the data request form at https: //forms.gle/3Waa4FQnqpK8UGSY7.All code related to this project is under MIT license.

G Author Statement
The authors of the STimage-1K4M dataset bear full responsibility for the content and compliance of this project.All authors of this paper have confirmed the data license.The data is now hosted on ftp server and will be maintained by the first author of this paper.

Figure 2 :
Figure 2: Popular tasks in ST data analysis.

Figure 3 :
Figure 3: Evaluation results.(a) Linear probing results, denoted by average macro F1 (error bars indicate standard deviations).(b) Silhouette, Calinski-Harabasz and Davies-Bouldin scores for image embeddings.(c) Histopathology image of brain sample 151675 colored by pathologist annotation.(d) t-SNE embeddings of sample 151675, colored by the same layer annotations as in (c).

Figure 4 :
Figure 4: Datasets with pathologist annotation.The points are colored by annotation in each dataset.The legend for mouse brain data (bottom right) are omitted for visualization.
Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?This dataset was curated by Dr. Yun Li and Dr. Didong Li's group on behalf of University of North Carolina at Chapel

•
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?No.•Does the dataset identify any subpopulations (e.g., by age, gender)?Not explicitly.•Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?No. • Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?No. How was the data associated with each instance acquired?We queried the GEO website using keywords "spatial transcriptomics", specifically targeting supplementary files including files in JPG, PNG, or TIFF formats.This search resulted in 856 datasets from 121 unique GEO studies.Additionally, we gathered 58 Visium and 4 VisiumHD datasets from 10X Genomics, complementing these with 233 slides manually collected from 10 additional studies.• What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)?We used rvest R package to gather download links for datasets in GEO.For other datasets, data were collected by human manual curation.• If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?Not applicable, this dataset is not a sample from a larger set.• Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?Jiawen Chen, Wenrong Wu and Jinwei Zhang are involved in the data collection process.All of them are graduate students.• Over what timeframe was the data collected?STimage-1K4M includes data generated from 2016-2024.• Were any ethical review processes conducted (e.g., by an institutional review board)?No official ethical review processes were conducted.• Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?The data were collected from websites.• Were the individuals in question notified about the data collection?Not applicable,